In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Graph neural networks (GNNs) have received remarkable success in link prediction (GNNLP) tasks. Existing efforts first predefine the subgraph for the whole dataset and then apply GNNs to encode edge representations by leveraging the neighborhood structure induced by the fixed subgraph. The prominence of GNNLP methods significantly relies on the adhoc subgraph. Since node connectivity in real-world graphs is complex, one shared subgraph is limited for all edges. Thus, the choices of subgraphs should be personalized to different edges. However, performing personalized subgraph selection is nontrivial since the potential selection space grows exponentially to the scale of edges. Besides, the inference edges are not available during training in link prediction scenarios, so the selection process needs to be inductive. To bridge the gap, we introduce a Personalized Subgraph Selector (PS2) as a plug-and-play framework to automatically, personally, and inductively identify optimal subgraphs for different edges when performing GNNLP. PS2 is instantiated as a bi-level optimization problem that can be efficiently solved differently. Coupling GNNLP models with PS2, we suggest a brand-new angle towards GNNLP training: by first identifying the optimal subgraphs for edges; and then focusing on training the inference model by using the sampled subgraphs. Comprehensive experiments endorse the effectiveness of our proposed method across various GNNLP backbones (GCN, GraphSage, NGCF, LightGCN, and SEAL) and diverse benchmarks (Planetoid, OGB, and Recommendation datasets). Our code is publicly available at \url{https://github.com/qiaoyu-tan/PS2}
translated by 谷歌翻译
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial trajectory-ensemble active learning (ATAL). Our contributions are three-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. {2)} Our proposed trajectory-ensemble uncertainty estimation method maintains the advantages of the ensemble networks while significantly reducing the computational cost. {3)} Our proposed relationship-aware diversity sampling algorithm can conquer oversampling while boosting performance. Experimental results show that our ATAL can find such a point-labeled dataset, where a saliency model trained on it obtained $97\%$ -- $99\%$ performance of its fully-supervised version with only ten annotated points per image.
translated by 谷歌翻译
Federated learning (FL) has been recognized as a privacy-preserving distributed machine learning paradigm that enables knowledge sharing among various heterogeneous artificial intelligence (AIoT) devices through centralized global model aggregation. FL suffers from model inaccuracy and slow convergence due to the model heterogeneity of the AIoT devices involved. Although various existing methods try to solve the bottleneck of the model heterogeneity problem, most of them improve the accuracy of heterogeneous models in a coarse-grained manner, which makes it still a great challenge to deploy large-scale AIoT devices. To alleviate the negative impact of this problem and take full advantage of the diversity of each heterogeneous model, we propose an efficient framework named HierarchyFL, which uses a small amount of public data for efficient and scalable knowledge across a variety of differently structured models. By using self-distillation and our proposed ensemble library, each hierarchical model can intelligently learn from each other on cloud servers. Experimental results on various well-known datasets show that HierarchyFL can not only maximize the knowledge sharing among various heterogeneous models in large-scale AIoT systems, but also greatly improve the model performance of each involved heterogeneous AIoT device.
translated by 谷歌翻译
We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words that are introduced to help represent what NL words hardly describe, allowing a prompt to be more descriptive. Like NL prompts, X-Prompt is out-of-distribution (OOD) robust, for which we propose context-guided learning with prompt augmentation to learn its imaginary words for general usability, enabling them to use in different prompt contexts for fine-grain specifications. The promising results of X-Prompt demonstrate its potential of approaching advanced interaction between humans and LLMs to bridge their communication gap.
translated by 谷歌翻译
非视线(NLOS)成像是一种用于检测障碍物或角落周围物体的物体的新兴技术。关于被动NLOS的最新研究主要集中在稳态测量和重建方法上,这些方法显示出识别移动目标的局限性。据我们所知,我们提出了一种新颖的基于事件的无源NLOS成像方法。我们获得了基于事件的异步数据,其中包含NLOS目标的详细动态信息,并有效缓解由运动引起的斑点降解。此外,我们创建了第一个基于事件的NLOS成像数据集NLOS-ES,并且由时间表面表示提取基于事件的功能。我们通过基于事件的数据与基于框架的数据比较重建。基于事件的方法在PSNR和LPIP上表现良好,该方法比基于框架的方法好20%和10%,而数据量仅占传统方法的2%。
translated by 谷歌翻译
基于深度学习的计算机辅助诊断(CAD)在学术研究和临床应用中引起了吸引人的关注。然而,卷积神经网络(CNN)诊断系统严重依赖于标记的病变数据集,对数据分布变化的敏感性也限制了CNN在CAD中的潜在应用。开发了无监督的域适应性(UDA)方法来解决昂贵的注释和域间隙问题,并在医学图像分析中取得了巨大的成功。然而,现有的UDA方法仅适应从源病变域中汲取的知识到一个单个目标病变域,这是针对临床情况的:要诊断的新的未标记的目标域始终以在线和连续的方式到达。此外,由于新知识的知识覆盖了先前学到的知识(即灾难性的遗忘),因此现有方法的性能在先前学到的目标病变域上大大降低。为了处理上述问题,我们开发了一个名为连续病变知识元适应(CLKM)的元适应框架,该框架主要由语义适应阶段(​​SAP)和表示适应阶段(​​RAP)组成,以在线学习诊断模型和连续的方式。在SAP中,从源病变域中学到的语义知识转移到连续的靶病变域。在RAP中,优化了功能提取器以对齐整个源和多个目标病变域的可转移表示知识。
translated by 谷歌翻译
部分标签学习(PLL)是一项奇特的弱监督学习任务,其中训练样本通常与一组候选标签而不是单个地面真理相关联。尽管在该域中提出了各种标签歧义方法,但他们通常假设在许多现实世界应用中可能不存在类平衡的方案。从经验上讲,我们在面对长尾分布和部分标记的组合挑战时观察到了先前方法的退化性能。在这项工作中,我们首先确定先前工作失败的主要原因。随后,我们提出了一种新型的基于最佳运输的框架太阳能,它允许完善被歧义的标签,以匹配边缘级别的先验分布。太阳能还结合了一种新的系统机制,用于估计PLL设置下的长尾类先验分布。通过广泛的实验,与先前的最先进的PLL方法相比,太阳能在标准化基准方面表现出基本优势。代码和数据可在以下网址获得:https://github.com/hbzju/solar。
translated by 谷歌翻译